Downloading Texts from Project Gutenberg using R

Author

Martin Schweinberger

Introduction

This how-to guide shows how to download, inspect, and clean texts from the Project Gutenberg archive using R. Project Gutenberg is one of the oldest and largest freely available digital libraries, containing over 70,000 ebooks whose US copyright has expired. It is an invaluable resource for researchers in literary studies, corpus linguistics, computational humanities, and any field requiring access to large amounts of digitised historical and literary text.

The R package gutenbergr provides convenient programmatic access to the Project Gutenberg catalogue, allowing you to search, filter, and download texts directly into your R session without manual downloading or file management.

Before You Start

This guide assumes basic familiarity with R. If you are new to R, please work through the following tutorials first:

What This Guide Covers
  1. Setup — installing and loading required packages
  2. A robust download function — handling mirror failures automatically
  3. Exploring the catalogue — browsing and searching available texts
  4. Filtering by author, language, subject, and rights
  5. Downloading individual texts
  6. Downloading multiple texts simultaneously
  7. Cleaning and preparing downloaded texts — removing boilerplate, splitting into sections, and saving for analysis
  8. Troubleshooting — encoding issues and texts not found
Citation

Schweinberger, Martin. 2026. Downloading Texts from Project Gutenberg using R. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/gutenberg/gutenberg.html (Version 2026.02.24).


Setup

Section Overview

What you’ll learn: How to install and load the packages needed for this guide

Installing Packages

Code
# Install required packages — run once, then comment out
install.packages("gutenbergr")   # access to Project Gutenberg catalogue and downloads
install.packages("dplyr")        # data manipulation (filter, select, mutate)
install.packages("stringr")      # string processing (cleaning text)
install.packages("tidyr")        # reshaping data
install.packages("ggplot2")      # visualisation
install.packages("flextable")    # formatted tables
install.packages("DT")           # interactive data tables
install.packages("here")         # portable file paths

Loading Packages

Code
# Load packages — run at the start of every session
library(gutenbergr)   # Project Gutenberg interface
library(dplyr)        # data manipulation
library(stringr)      # string processing
library(tidyr)        # data reshaping
library(ggplot2)      # plotting
library(flextable)    # formatted tables
library(DT)           # interactive HTML tables
library(here)         # portable file paths
Why Not library(tidyverse)?

Loading individual packages (dplyr, stringr, etc.) is preferable to library(tidyverse) for reproducibility: it makes dependencies explicit, avoids namespace conflicts, and ensures your code works even if the Tidyverse bundle changes. LADAL tutorials follow this best practice throughout.


A Robust Download Function

Section Overview

What you’ll learn: Why direct gutenberg_download() calls sometimes return empty results, and how to define a single reliable helper function that all subsequent downloads use

Project Gutenberg’s servers and mirrors can be unreliable — a direct gutenberg_download() call may silently return zero lines even when the ID is correct. The most robust approach is to:

  1. Try several mirrors in sequence via gutenbergr
  2. Fall back to reading the raw plain-text file directly from the Project Gutenberg cache URL, which is always at https://www.gutenberg.org/cache/epub/{ID}/pg{ID}.txt

We define this logic once as a helper function and use it throughout the guide:

Code
# Helper function: download a single text by Gutenberg ID
# Tries gutenbergr mirrors first; falls back to direct URL read if all fail
# Arguments:
#   id          : integer gutenberg_id
#   meta_fields : character vector of metadata columns to attach (passed to gutenberg_download)
#   title_fallback : title string to use in the fallback data frame
gutenberg_safe <- function(id, meta_fields = "title", title_fallback = NA_character_) {

  # List of mirrors to try in order
  mirrors <- c(
    "http://mirrors.xmission.com/gutenberg/",
    "http://gutenberg.pglaf.org/",
    "https://gutenberg.readingroo.ms/",
    "http://gutenberg.nabasny.com/"
  )

  result <- NULL

  # Step 1: try each mirror via gutenbergr
  for (m in mirrors) {
    tryCatch({
      dl <- gutenberg_download(id, meta_fields = meta_fields, mirror = m)
      if (!is.null(dl) && nrow(dl) > 0) {
        message("Downloaded ID ", id, " via mirror: ", m)
        result <- dl
        break
      }
    }, error   = function(e) NULL,
       warning = function(w) NULL)
  }

  # Step 2: fall back to direct cache URL if all mirrors failed
  if (is.null(result) || nrow(result) == 0) {
    message("All mirrors failed for ID ", id, " — trying direct cache URL")
    cache_url <- paste0("https://www.gutenberg.org/cache/epub/", id, "/pg", id, ".txt")
    tryCatch({
      lines <- readLines(url(cache_url), warn = FALSE, encoding = "UTF-8")
      # Look up title from metadata if not supplied
      if (is.na(title_fallback)) {
        title_fallback <- gutenberg_metadata |>
          dplyr::filter(gutenberg_id == id) |>
          dplyr::pull(title) |>
          dplyr::first()
      }
      result <- data.frame(
        gutenberg_id = id,
        text         = lines,
        title        = title_fallback,
        stringsAsFactors = FALSE
      )
      message("Downloaded ID ", id, " via direct cache URL (", nrow(result), " lines)")
    }, error = function(e) {
      stop("Could not download ID ", id, ": ", conditionMessage(e))
    })
  }

  result
}
Why a Helper Function?

Defining gutenberg_safe() once and calling it throughout means:

  • Every download in this guide uses the same robust fallback logic
  • If Project Gutenberg updates its mirror list, you only need to update one place
  • The function is self-documenting — the mirrors and fallback URL are visible in one location
  • You can copy gutenberg_safe() directly into your own projects

Exploring the Project Gutenberg Catalogue

Section Overview

What you’ll learn: How to browse and search the full Project Gutenberg catalogue, and what metadata fields are available for filtering

The Metadata Table

The gutenbergr package ships with a metadata table — gutenberg_metadata — that contains information about every text in the Project Gutenberg archive. You can inspect it directly without downloading anything:

Code
# Load the full metadata table
# This is a local data frame included with the gutenbergr package
overview <- gutenberg_metadata

# How many texts are available?
cat("Total texts in catalogue:", nrow(overview), "\n")
Total texts in catalogue: 72569 
Code
cat("Metadata columns:", ncol(overview), "\n")
Metadata columns: 8 
Code
cat("Column names:", paste(names(overview), collapse = ", "), "\n")
Column names: gutenberg_id, title, author, gutenberg_author_id, language, gutenberg_bookshelf, rights, has_text 

The metadata table contains the following key fields:

Field

Description

gutenberg_id

Unique numeric identifier for each text

title

Title of the work

author

Author name in 'Surname, Firstname' format

gutenberg_author_id

Unique identifier for the author (useful for finding all works by one author)

language

ISO 639 language code (e.g. 'en', 'de', 'fr')

gutenberg_bookshelf

Thematic bookshelf category (e.g. 'Science Fiction', 'History')

rights

Copyright status (typically 'Public domain in the USA.')

has_text

Whether a plain text version is available for download

Browsing with gutenberg_works()

The gutenberg_works() function is a convenience wrapper around gutenberg_metadata that returns only texts with a downloadable plain text version (has_text == TRUE) in the public domain:

Code
# Browse all available public domain texts with plain text versions
all_works <- gutenberg_works()
cat("Texts available via gutenberg_works():", nrow(all_works), "\n")

Filtering the Catalogue

Section Overview

What you’ll learn: How to filter the Project Gutenberg catalogue by author, language, subject/bookshelf, and multiple criteria to find exactly the texts you need

Filter by Author

Author names in the catalogue are stored in “Surname, Firstname” format:

Code
# Find all works by Charles Darwin using exact name format
darwin_works <- gutenberg_works(author == "Darwin, Charles")
cat("Works by Charles Darwin:", nrow(darwin_works), "\n")
Works by Charles Darwin: 31 

When unsure of the exact name format, use str_detect() for partial matching:

Code
# Partial name search — more robust than exact matching
austen_works <- gutenberg_works(
  stringr::str_detect(author, "Austen")
)
cat("Works matching 'Austen':", nrow(austen_works), "\n")
Works matching 'Austen': 16 
Code
austen_works |> dplyr::select(gutenberg_id, title, author)
# A tibble: 16 × 3
   gutenberg_id title                                                     author
          <int> <chr>                                                     <chr> 
 1          105 "Persuasion"                                              Auste…
 2          121 "Northanger Abbey"                                        Auste…
 3          141 "Mansfield Park"                                          Auste…
 4          158 "Emma"                                                    Auste…
 5          946 "Lady Susan"                                              Auste…
 6         1212 "Love and Freindship [sic]"                               Auste…
 7         1342 "Pride and Prejudice"                                     Auste…
 8        17797 "Memoir of Jane Austen"                                   Auste…
 9        21839 "Sense and Sensibility"                                   Auste…
10        22536 "Jane Austen, Her Life and Letters: A Family Record"      Auste…
11        22536 "Jane Austen, Her Life and Letters: A Family Record"      Auste…
12        31100 "The Complete Project Gutenberg Works of Jane Austen\nA … Auste…
13        33513 "The Frightened Planet"                                   Auste…
14        37431 "Pride and Prejudice, a play founded on Jane Austen's no… Auste…
15        39897 "Discoveries Among the Ruins of Nineveh and Babylon"      Layar…
16        42078 "The Letters of Jane Austen\r\nSelected from the compila… Auste…

Filter by Language

The language field uses ISO 639-1 two-letter codes:

Code
# Count German-language texts available
gutenberg_works(
  languages     = "de",
  all_languages = TRUE
) |>
  dplyr::count(language, sort = TRUE)
# A tibble: 1 × 2
  language     n
  <chr>    <int>
1 de        1296
Common Language Codes
Code Language Code Language
en English de German
fr French it Italian
es Spanish nl Dutch
pt Portuguese la Latin
fi Finnish zh Chinese

For a full list, see the ISO 639-1 standard.

Code
# Count texts per language across the full catalogue
lang_counts <- gutenberg_metadata |>
  dplyr::filter(has_text == TRUE) |>
  dplyr::count(language, sort = TRUE) |>
  dplyr::filter(!is.na(language)) |>
  head(15)
Code
lang_counts |>
  dplyr::mutate(language = reorder(language, n)) |>
  ggplot(aes(x = language, y = n)) +
  geom_col(fill = "steelblue", width = 0.7) +
  coord_flip() +
  labs(
    title    = "Project Gutenberg: Texts by Language",
    subtitle = "Top 15 languages (texts with downloadable plain text only)",
    x        = "Language (ISO 639-1)",
    y        = "Number of texts"
  ) +
  theme_bw() +
  theme(panel.grid.minor = element_blank())

Filter by Subject / Bookshelf

Project Gutenberg organises texts into thematic “bookshelves”:

Code
# Find all texts on the Science Fiction bookshelf
scifi <- gutenberg_works(
  stringr::str_detect(gutenberg_bookshelf, "Science Fiction")
)
cat("Science Fiction texts:", nrow(scifi), "\n")
Science Fiction texts: 1306 
Code
scifi |> dplyr::select(gutenberg_id, title, author) |> head(10)
# A tibble: 10 × 3
   gutenberg_id title                                       author              
          <int> <chr>                                       <chr>               
 1           36 The War of the Worlds                       Wells, H. G. (Herbe…
 2           42 The Strange Case of Dr. Jekyll and Mr. Hyde Stevenson, Robert L…
 3           62 A Princess of Mars                          Burroughs, Edgar Ri…
 4           64 The Gods of Mars                            Burroughs, Edgar Ri…
 5           68 The warlord of Mars                         Burroughs, Edgar Ri…
 6           72 Thuvia, Maid of Mars                        Burroughs, Edgar Ri…
 7           86 A Connecticut Yankee in King Arthur's Court Twain, Mark         
 8           96 The Monster Men                             Burroughs, Edgar Ri…
 9           97 Flatland: A Romance of Many Dimensions      Abbott, Edwin Abbott
10          123 At the Earth's Core                         Burroughs, Edgar Ri…
Code
# Browse the top 20 most populated bookshelves
gutenberg_metadata |>
  dplyr::filter(!is.na(gutenberg_bookshelf), has_text == TRUE) |>
  tidyr::separate_rows(gutenberg_bookshelf, sep = "/") |>
  dplyr::mutate(gutenberg_bookshelf = stringr::str_trim(gutenberg_bookshelf)) |>
  dplyr::count(gutenberg_bookshelf, sort = TRUE) |>
  head(20)
# A tibble: 20 × 2
   gutenberg_bookshelf                                        n
   <chr>                                                  <int>
 1 ""                                                     39230
 2 "Science Fiction"                                       1323
 3 "FR Littérature"                                         666
 4 "Children's Book Series"                                 494
 5 "Punch"                                                  493
 6 "Bestsellers, American, 1895-1923"                       394
 7 "World War I"                                            386
 8 "Historical Fiction"                                     340
 9 "US Civil War"                                           337
10 "Children's Fiction"                                     329
11 "Animal"                                                 303
12 "DE Prosa"                                               295
13 "Children's Literature"                                  269
14 "Technology"                                             236
15 "L'Illustration"                                         220
16 "Notes and Queries"                                      217
17 "The Mirror of Literature, Amusement, and Instruction"   202
18 "Christianity"                                           178
19 "Children's Picture Books"                               177
20 "United Kingdom"                                         170

Filter by Multiple Criteria

Combine conditions to narrow the catalogue precisely:

Code
# English-language science texts
english_science <- gutenberg_works(
  language == "en",
  stringr::str_detect(gutenberg_bookshelf, "(?i)science|natural|biology|astronomy")
)
cat("English science texts:", nrow(english_science), "\n")
English science texts: 1445 
Code
english_science |>
  dplyr::select(gutenberg_id, title, author, gutenberg_bookshelf) |>
  head(10)
# A tibble: 10 × 4
   gutenberg_id title                                 author gutenberg_bookshelf
          <int> <chr>                                 <chr>  <chr>              
 1           36 The War of the Worlds                 Wells… Movie Books/Scienc…
 2           42 The Strange Case of Dr. Jekyll and M… Steve… Precursors of Scie…
 3           62 A Princess of Mars                    Burro… Best Books Ever Li…
 4           64 The Gods of Mars                      Burro… Science Fiction    
 5           68 The warlord of Mars                   Burro… Science Fiction    
 6           72 Thuvia, Maid of Mars                  Burro… Science Fiction    
 7           86 A Connecticut Yankee in King Arthur'… Twain… Precursors of Scie…
 8           96 The Monster Men                       Burro… Science Fiction    
 9           97 Flatland: A Romance of Many Dimensio… Abbot… Science Fiction/Ma…
10          123 At the Earth's Core                   Burro… Science Fiction    

Downloading Individual Texts

Section Overview

What you’ll learn: How to download a single text by ID using gutenberg_safe(), and what the downloaded data looks like

Always Use the Gutenberg ID

Every text has a unique numeric ID visible in its Project Gutenberg URL (e.g., gutenberg.org/ebooks/1513). Downloading by ID is more reliable than searching by title, which can match multiple entries. Use gutenberg_works() or browse gutenberg.org to look up IDs before downloading.

Download Romeo and Juliet (ID: 1513)

Code
# Download Romeo and Juliet using gutenberg_safe()
# gutenberg_safe() tries multiple mirrors, then falls back to the direct cache URL
romeo <- gutenberg_safe(1513)

cat("Downloaded:", nrow(romeo), "lines\n")
Downloaded: 5647 lines
Code
cat("Columns:", paste(names(romeo), collapse = ", "), "\n")
Columns: gutenberg_id, text, title 

gutenberg_id

text

title

1,513

The Project Gutenberg eBook of Romeo and Juliet

Romeo and Juliet

1,513

Romeo and Juliet

1,513

This ebook is for the use of anyone anywhere in the United States and

Romeo and Juliet

1,513

most other parts of the world at no cost and with almost no restrictions

Romeo and Juliet

1,513

whatsoever. You may copy it, give it away or re-use it under the terms

Romeo and Juliet

1,513

of the Project Gutenberg License included with this ebook or online

Romeo and Juliet

1,513

at www.gutenberg.org. If you are not located in the United States,

Romeo and Juliet

1,513

you will have to check the laws of the country where you are located

Romeo and Juliet

1,513

before using this eBook.

Romeo and Juliet

1,513

Romeo and Juliet

1,513

Title: Romeo and Juliet

Romeo and Juliet

1,513

Romeo and Juliet

1,513

Author: William Shakespeare

Romeo and Juliet

1,513

Romeo and Juliet

1,513

Release date: November 1, 1998 [eBook #1513]

Romeo and Juliet

Download with Additional Metadata

The meta_fields argument attaches metadata columns to the downloaded text — useful when combining multiple texts into a corpus:

Code
# Download On the Origin of Species with title, author, and language attached
origin_species <- gutenberg_safe(
  1228,                                          # On the Origin of Species
  meta_fields = c("title", "author", "language")
)

cat("Title:",    unique(origin_species$title), "\n")
Title: On the Origin of Species By Means of Natural Selection
Or, the Preservation of Favoured Races in the Struggle for Life 
Code
cat("Author:",   unique(origin_species$author), "\n")
Author: 
Code
cat("Language:", unique(origin_species$language), "\n")
Language: 
Code
cat("Lines:",    nrow(origin_species), "\n")
Lines: 16570 

Downloading Multiple Texts

Section Overview

What you’ll learn: How to download several texts at once and organise them into a labelled corpus ready for analysis

Downloading by ID Vector

To download multiple texts, call gutenberg_safe() for each ID and combine the results with dplyr::bind_rows():

Code
# Download Wuthering Heights (768) and Jane Eyre (1260)
# Call gutenberg_safe() for each ID, then stack the results
bronte_texts <- dplyr::bind_rows(
  gutenberg_safe(768),    # Wuthering Heights — Emily Brontë
  gutenberg_safe(1260)    # Jane Eyre — Charlotte Brontë
)

# How many lines from each text?
bronte_texts |>
  dplyr::count(title, name = "lines")
# A tibble: 2 × 2
  title                       lines
  <chr>                       <int>
1 Jane Eyre: An Autobiography 21381
2 Wuthering Heights           12342

title

Number of lines

Jane Eyre: An Autobiography

21,381

Wuthering Heights

12,342

Downloading All Works by an Author

Retrieve all IDs for an author from the catalogue, then loop through them:

Code
# Find all Charles Dickens IDs
dickens_ids <- gutenberg_works(
  author   == "Dickens, Charles",
  language == "en"
) |>
  dplyr::pull(gutenberg_id)

cat("Dickens texts available:", length(dickens_ids), "\n")
Dickens texts available: 54 
Code
cat("First 10 IDs:", paste(head(dickens_ids, 10), collapse = ", "), "\n")
First 10 IDs: 46, 564, 580, 699, 700, 730, 766, 821, 917, 963 
Code
# Download all Dickens texts — this may take several minutes
# purrr::map_dfr() loops over each ID and stacks the results
dickens_corpus <- purrr::map_dfr(
  dickens_ids,
  ~ gutenberg_safe(.x, meta_fields = c("title", "author"))
)

cat("Total lines:", nrow(dickens_corpus), "\n")
cat("Texts downloaded:", length(unique(dickens_corpus$title)), "\n")
Large Downloads

Downloading many texts at once can take several minutes. Best practices:

  • Save immediately after downloading (see the Saving section below) to avoid re-downloading
  • Download in batches if fetching more than ~20 texts
  • Be respectful of Project Gutenberg’s resources — it is a non-profit volunteer project

Building a Multi-Author Corpus

Code
# Download three 19th-century texts for comparative analysis:
# Moby Dick (2701), Pride and Prejudice (1342), On the Origin of Species (1228)
comparison_corpus <- dplyr::bind_rows(
  gutenberg_safe(2701, meta_fields = c("title", "author")),  # Moby Dick
  gutenberg_safe(1342, meta_fields = c("title", "author")),  # Pride and Prejudice
  gutenberg_safe(1228, meta_fields = c("title", "author"))   # On the Origin of Species
) |>
  # If the author column is missing (fallback download), add it from metadata
  (\(df) {
    if (!"author" %in% names(df)) {
      df <- df |>
        dplyr::left_join(
          gutenberg_metadata |> dplyr::select(gutenberg_id, author),
          by = "gutenberg_id"
        )
    }
    df
  })()

# Corpus summary
comparison_corpus |>
  dplyr::group_by(author, title) |>
  dplyr::summarise(
    lines = dplyr::n(),
    words = sum(stringr::str_count(text, "\\S+"), na.rm = TRUE),
    .groups = "drop"
  )
# A tibble: 3 × 4
  author           title                                            lines  words
  <chr>            <chr>                                            <int>  <int>
1 Austen, Jane     "Pride and Prejudice"                            14911 130410
2 Darwin, Charles  "On the Origin of Species By Means of Natural S… 16570 158589
3 Melville, Herman "Moby Dick; Or, The Whale"                       22310 215840

author

title

Lines

Words

Austen, Jane

Pride and Prejudice

14,911

130,410

Darwin, Charles

On the Origin of Species By Means of Natural Selection
Or, the Preservation of Favoured Races in the Struggle for Life

16,570

158,589

Melville, Herman

Moby Dick; Or, The Whale

22,310

215,840


Cleaning and Preparing Downloaded Texts

Section Overview

What you’ll learn: How to remove Project Gutenberg boilerplate, collapse lines into continuous text, split into chapters or acts, and save cleaned texts for analysis

Why this matters: Raw downloads include licence notices and formatting artefacts that distort frequency analysis, topic models, and other quantitative methods if not removed.

What Raw Downloads Look Like

Each download is a line-by-line data frame. The first and last portions contain boilerplate licence text:

Code
# Inspect the opening lines — boilerplate header visible here
head(romeo$text, 30)
 [1] "The Project Gutenberg eBook of Romeo and Juliet"                                      
 [2] "    "                                                                                 
 [3] "This ebook is for the use of anyone anywhere in the United States and"                
 [4] "most other parts of the world at no cost and with almost no restrictions"             
 [5] "whatsoever. You may copy it, give it away or re-use it under the terms"               
 [6] "of the Project Gutenberg License included with this ebook or online"                  
 [7] "at www.gutenberg.org. If you are not located in the United States,"                   
 [8] "you will have to check the laws of the country where you are located"                 
 [9] "before using this eBook."                                                             
[10] ""                                                                                     
[11] "Title: Romeo and Juliet"                                                              
[12] ""                                                                                     
[13] "Author: William Shakespeare"                                                          
[14] ""                                                                                     
[15] "Release date: November 1, 1998 [eBook #1513]"                                         
[16] "                Most recently updated: September 18, 2025"                            
[17] ""                                                                                     
[18] "Language: English"                                                                    
[19] ""                                                                                     
[20] "Credits: the PG Shakespeare Team, a team of about twenty Project Gutenberg volunteers"
[21] ""                                                                                     
[22] ""                                                                                     
[23] "*** START OF THE PROJECT GUTENBERG EBOOK ROMEO AND JULIET ***"                        
[24] ""                                                                                     
[25] ""                                                                                     
[26] ""                                                                                     
[27] ""                                                                                     
[28] "THE TRAGEDY OF ROMEO AND JULIET"                                                      
[29] ""                                                                                     
[30] "by William Shakespeare"                                                               
Code
# Inspect the closing lines — boilerplate footer visible here
tail(romeo$text, 20)
 [1] "Gutenberg™ concept of a library of electronic works that could be"    
 [2] "freely shared with anyone. For forty years, he produced and"          
 [3] "distributed Project Gutenberg™ eBooks with only a loose network of"   
 [4] "volunteer support."                                                   
 [5] ""                                                                     
 [6] "Project Gutenberg™ eBooks are often created from several printed"     
 [7] "editions, all of which are confirmed as not protected by copyright in"
 [8] "the U.S. unless a copyright notice is included. Thus, we do not"      
 [9] "necessarily keep eBooks in compliance with any particular paper"      
[10] "edition."                                                             
[11] ""                                                                     
[12] "Most people start at our website which has the main PG search"        
[13] "facility: www.gutenberg.org."                                         
[14] ""                                                                     
[15] "This website includes information about Project Gutenberg™,"          
[16] "including how to make donations to the Project Gutenberg Literary"    
[17] "Archive Foundation, how to help produce our new eBooks, and how to"   
[18] "subscribe to our email newsletter to hear about new eBooks."          
[19] ""                                                                     
[20] ""                                                                     

Removing Boilerplate

Every Project Gutenberg text uses *** START OF and *** END OF as consistent boundary markers:

Code
# Find the start and end marker line positions
start_marker <- which(stringr::str_detect(romeo$text, "\\*\\*\\* START OF"))
end_marker   <- which(stringr::str_detect(romeo$text, "\\*\\*\\* END OF"))

cat("START marker at line:", start_marker, "\n")
START marker at line: 23 
Code
cat("END marker at line:",   end_marker, "\n")
END marker at line: 5297 
Code
# Keep only lines between the two markers
romeo_clean <- romeo |>
  dplyr::slice((start_marker + 1):(end_marker - 1)) |>
  dplyr::filter(!is.na(text))

cat("Lines after boilerplate removal:", nrow(romeo_clean),
    "(removed", nrow(romeo) - nrow(romeo_clean), ")\n")
Lines after boilerplate removal: 5273 (removed 374 )

Removing Empty Lines

Code
# Remove lines that are empty or contain only whitespace
romeo_clean <- romeo_clean |>
  dplyr::filter(stringr::str_trim(text) != "")

cat("Lines after removing empty lines:", nrow(romeo_clean), "\n")
Lines after removing empty lines: 4137 

Collapsing to a Single String

Code
# Join all lines into one continuous string, then normalise whitespace
romeo_text <- romeo_clean$text |>
  paste(collapse = " ") |>
  stringr::str_squish()

cat("Total characters:", nchar(romeo_text), "\n")
Total characters: 141104 
Code
cat("First 300 characters:\n", substr(romeo_text, 1, 300), "\n")
First 300 characters:
 THE TRAGEDY OF ROMEO AND JULIET by William Shakespeare Contents THE PROLOGUE. ACT I Scene I. A public place. Scene II. A Street. Scene III. Room in Capulet’s House. Scene IV. A Street. Scene V. A Hall in Capulet’s House. ACT II CHORUS. Scene I. An open place adjoining Capulet’s Garden. Scene II. Cap 

Splitting into Acts and Scenes

Code
# Split Romeo and Juliet into Acts using a regex on Roman numeral headings
acts <- romeo_text |>
  stringr::str_replace_all("(ACT [IVX]+\\.?)", "|||\\1") |>  # insert split marker
  stringr::str_split("\\|\\|\\|") |>
  unlist() |>
  (\(x) x[nchar(stringr::str_trim(x)) > 20])()   # drop very short fragments

cat("Segments found:", length(acts), "\n")
Segments found: 11 
Code
cat("Segment 2 begins:", substr(acts[2], 1, 120), "\n")
Segment 2 begins: ACT I Scene I. A public place. Scene II. A Street. Scene III. Room in Capulet’s House. Scene IV. A Street. Scene V. A Ha 

Splitting into Chapters

Code
# Clean Wuthering Heights from the bronte_texts corpus
wuthering <- bronte_texts |>
  dplyr::filter(stringr::str_detect(title, "Wuthering"))

# Diagnostic: check what the opening lines look like
# (useful for seeing the exact marker format used)
cat("First 5 lines:\n")
First 5 lines:
Code
cat(head(wuthering$text, 5), sep = "\n")
Wuthering Heights

by Emily Brontë
Code
# Find boilerplate markers — try several common variants
wh_start <- which(stringr::str_detect(
  wuthering$text,
  stringr::regex("\\*{3}\\s*START OF", ignore_case = TRUE)
))
wh_end <- which(stringr::str_detect(
  wuthering$text,
  stringr::regex("\\*{3}\\s*END OF", ignore_case = TRUE)
))

# If markers not found, use the full text with no trimming
if (length(wh_start) == 0) {
  cat("START marker not found — using full text\n")
  wh_start <- 0L
}
START marker not found — using full text
Code
if (length(wh_end) == 0) {
  cat("END marker not found — using full text\n")
  wh_end <- nrow(wuthering) + 1L
}
END marker not found — using full text
Code
# Slice between markers (or use full text if markers absent)
wh_text <- wuthering |>
  dplyr::slice((wh_start[1] + 1):(wh_end[1] - 1)) |>
  dplyr::filter(stringr::str_trim(text) != "") |>
  dplyr::pull(text) |>
  paste(collapse = " ") |>
  stringr::str_squish()

cat("Characters in cleaned text:", nchar(wh_text), "\n")
Characters in cleaned text: 643482 
Code
# Split on CHAPTER headings (Roman or Arabic numerals)
wh_chapters <- wh_text |>
  stringr::str_replace_all("(CHAPTER\\s+[IVXLCDM0-9]+\\.?)", "|||\\1") |>
  stringr::str_split("\\|\\|\\|") |>
  unlist() |>
  (\(x) x[nchar(stringr::str_trim(x)) > 50])()

cat("Chapters found:", length(wh_chapters), "\n")
Chapters found: 34 
Code
cat("Chapter 1 begins:", substr(wh_chapters[2], 1, 150), "\n")
Chapter 1 begins: CHAPTER II Yesterday afternoon set in misty and cold. I had half a mind to spend it by my study fire, instead of wading through heath and mud to Wuthe 

Saving Cleaned Texts

Save downloaded and cleaned data immediately to avoid re-downloading in future sessions:

Code
# Create data directory if needed
if (!dir.exists(here::here("data"))) {
  dir.create(here::here("data"), recursive = TRUE)
}

# Save as RDS (R's native binary format — fast and lossless)
saveRDS(romeo_text,        here::here("data", "romeo_clean.rds"))
saveRDS(wh_chapters,       here::here("data", "wh_chapters.rds"))
saveRDS(comparison_corpus, here::here("data", "comparison_corpus.rds"))

# Save as plain text for use outside R
writeLines(romeo_text, here::here("data", "romeo_clean.txt"))

cat("Saved to:", here::here("data"), "\n")
Code
# Load saved data in future sessions — no re-downloading needed
romeo_text     <- readRDS(here::here("data", "romeo_clean.rds"))
wh_chapters    <- readRDS(here::here("data", "wh_chapters.rds"))
comparison_corpus <- readRDS(here::here("data", "comparison_corpus.rds"))

Troubleshooting

Section Overview

What you’ll learn: How to handle encoding issues and texts that are not found in the catalogue

Encoding Issues

Some older texts use Latin-1 encoding rather than UTF-8, producing garbled characters for accented letters:

Code
# Fix garbled characters by re-encoding from Latin-1 to UTF-8
text_fixed <- text |>
  dplyr::mutate(
    text = iconv(text, from = "latin1", to = "UTF-8", sub = "byte")
  )

# For individual strings
clean_line <- enc2utf8(text$text[1])

Text Not Found

If gutenberg_works() returns zero rows or gutenberg_safe() fails:

Code
# Problem 1: exact title match fails
# Solution: partial, case-insensitive search
gutenberg_works(
  stringr::str_detect(stringr::str_to_lower(title), "romeo")
)

# Problem 2: text has no downloadable plain text version
# Solution: check has_text == TRUE
gutenberg_metadata |>
  dplyr::filter(title == "Romeo and Juliet", has_text == TRUE)

# Problem 3: check rights status
gutenberg_metadata |>
  dplyr::filter(title == "Romeo and Juliet") |>
  dplyr::select(gutenberg_id, title, rights, has_text)

Verifying a Download

Code
# Reusable function to quickly check a downloaded text data frame
verify_download <- function(text_df, min_lines = 100) {
  cat("--- Download Verification ---\n")
  cat("Rows:", nrow(text_df), "\n")
  cat("Columns:", paste(names(text_df), collapse = ", "), "\n")
  cat("Empty lines:", sum(is.na(text_df$text) | text_df$text == ""), "\n")
  if ("title"  %in% names(text_df)) cat("Title:",  unique(text_df$title),  "\n")
  if ("author" %in% names(text_df)) cat("Author:", unique(text_df$author), "\n")
  if (nrow(text_df) < min_lines) warning("Download seems very short — check for errors")
  cat("First non-empty line:",
      text_df$text[which(nzchar(text_df$text))[1]], "\n")
}

verify_download(romeo, min_lines = 500)
--- Download Verification ---
Rows: 5647 
Columns: gutenberg_id, text, title 
Empty lines: 1194 
Title: Romeo and Juliet 
First non-empty line: The Project Gutenberg eBook of Romeo and Juliet 

AI Statement

This how-to guide was substantially revised and expanded from the original LADAL draft (gutenberg.qmd) with the assistance of Claude (Anthropic), an AI language model. The AI was used to: restructure the guide into a logical sequence of sections; add the gutenberg_safe() helper function (applying the mirror-loop + direct-URL fallback pattern consistently across all download calls, replacing the original gutenberg_download() pipe approach that returned empty results); expand filtering coverage to include bookshelf filtering, multi-criteria filtering, and partial name matching; add the cleaning and preparation section (boilerplate removal, splitting into acts/chapters, saving/loading); add the troubleshooting section; add the language frequency bar plot and metadata fields table; convert all formatting to Quarto callouts and LADAL flextable style; and update the YAML and citation. All content and workflow decisions were reviewed by the tutorial author.


Citation & Session Info

Schweinberger, Martin. 2026. Downloading Texts from Project Gutenberg using R. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/gutenberg/gutenberg.html (Version 2026.02.24).

@manual{schweinberger2026gb,
  author       = {Schweinberger, Martin},
  title        = {Downloading Texts from Project Gutenberg using R},
  note         = {https://ladal.edu.au/tutorials/gutenberg/gutenberg.html},
  year         = {2026},
  organization = {The Language Technology and Data Analysis Laboratory (LADAL)},
  address      = {Brisbane},
  edition      = {2026.02.24}
}
Code
sessionInfo()
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] flextable_0.9.7  DT_0.33          gutenbergr_0.2.4 lubridate_1.9.4 
 [5] forcats_1.0.0    stringr_1.5.1    dplyr_1.1.4      purrr_1.0.4     
 [9] readr_2.1.5      tidyr_1.3.1      tibble_3.2.1     ggplot2_3.5.1   
[13] tidyverse_2.0.0 

loaded via a namespace (and not attached):
 [1] gtable_0.3.6            xfun_0.51               bslib_0.9.0            
 [4] htmlwidgets_1.6.4       tzdb_0.4.0              vctrs_0.6.5            
 [7] tools_4.4.2             crosstalk_1.2.1         generics_0.1.3         
[10] curl_6.2.1              parallel_4.4.2          klippy_0.0.0.9500      
[13] pkgconfig_2.0.3         data.table_1.17.0       assertthat_0.2.1       
[16] uuid_1.2-1              lifecycle_1.0.4         compiler_4.4.2         
[19] textshaping_1.0.0       munsell_0.5.1           codetools_0.2-20       
[22] fontquiver_0.2.1        fontLiberation_0.1.0    htmltools_0.5.8.1      
[25] sass_0.4.9              lazyeval_0.2.2          yaml_2.3.10            
[28] crayon_1.5.3            pillar_1.10.1           jquerylib_0.1.4        
[31] openssl_2.3.2           cachem_1.1.0            fontBitstreamVera_0.1.1
[34] tidyselect_1.2.1        zip_2.3.2               digest_0.6.37          
[37] stringi_1.8.4           fastmap_1.2.0           grid_4.4.2             
[40] colorspace_2.1-1        cli_3.6.4               magrittr_2.0.3         
[43] triebeard_0.4.1         utf8_1.2.4              withr_3.0.2            
[46] gdtools_0.4.1           scales_1.3.0            bit64_4.6.0-1          
[49] timechange_0.3.0        rmarkdown_2.29          officer_0.6.7          
[52] bit_4.5.0.1             askpass_1.2.1           ragg_1.3.3             
[55] hms_1.1.3               evaluate_1.0.3          knitr_1.49             
[58] urltools_1.7.3          rlang_1.1.5             Rcpp_1.0.14            
[61] glue_1.8.0              xml2_1.3.6              renv_1.1.1             
[64] vroom_1.6.5             rstudioapi_0.17.1       jsonlite_1.9.0         
[67] R6_2.6.1                systemfonts_1.2.1      

Back to top

Back to LADAL home